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Abstract. A valuation for a player in a game in extensive form 
is an assignment of numeric values to the players moves. The val- 
uation reflects the desirability moves. We assume a myopic player, 
who chooses a move with the highest valuation. Valuations can also 
be revised, and hopefully improved, after each play of the game. 
Here, a very simple valuation revision is considered, in which the 
moves made in a play are assigned the payoff obtained in the play. 
We show that by adopting such a learning process a player who 
has a winning strategy in a win-lose game can almost surely guar- 
antee a win in a repeated game. When a player has more than two 
payoffs, a more elaborate learning procedure is required. We con- 
sider one that associates with each move the average payoff in the 
rounds in which this move was made. When all players adopt this 
learning procedure, with some perturbations, then, with probabil- 
ity 1, strategies that are close to subgame perfect equilibrium are 
played after some time. A single player who adopts this procedure 
can guarantee only her individually rational payoff. 



1. Introduction 

Models of learning in games fall roughly into two categories. In the 
first, the learning player forms beliefs about the future behavior of 
other players and nature, and directs her behavior according to these 
beliefs. We refer to these as fictitious-player-like models. In the second, 
the player is attuned only to her own performance in the game, and 
uses it to improve future performance. These are called models of 
reinforcement learning. 

Reinforcement learning has been used extensively in artificial intel- 
ligence (AI). Samuel wrote a checkers-playing learning program as far 
back as 1955, which marks the beginning of reinforcement learning 
(see [Samuel (1959|) ). Since then many other sophisticated algorithms, 



heuristics, and computer programs, have been developed, which are 
based on reinforcement learning. ( [Sutton and Barto (1998| )). Such 
programs try neither to learn the behavior of a specific opponent, nor 
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to find the distribution of opponents' behavior in the population. In- 
stead, they learn how to improve their play from the achievements of 
past behavior. 

Until recently, game theorists studied mostly fictitious-player-like 
models. Reinforcement learning has only attracted the attention of 
game theorists in the last decade in theoretical works like |Gilboa and 



schmeidler (1995|) , |Camerer and Ho (1997| ), [Sarin and Vahid (1999| ) 



and in experimental works like |Erev and Roth (1997| ). In all these 



studies the basic model is given in a strategic form, and the learning 
player identifies those of her strategies that perform better. This ap- 
proach seems inadequate where learning of games in extensive form is 
concerned. Except for the simplest games in extensive form, the size of 
the strategy space is so large that learning, by human beings or even 
machines, cannot involve the set of all strategies. This is certainly true 
for the game of chess, where the number of strategies exceeds the num- 
ber of particles in the universe. But even a simple game like tic-tac-toe 
is not perceived by human players in the full extent of its strategic 
form. 

The process of learning games in extensive form can involve only a 
relatively small number of simple strategies. But when the strategic 
form is the basic model, no subset of strategies can be singled out. 
Thus, for games in extensive form the structure of the game tree should 
be taken into consideration. Instead of strategies being reinforced, as 
for games in strategic form, it is the moves of the game that should be 
reinforced for games in extensive form. 

This, indeed, is the approach of heuristics for playing games which 
were developed by AI theorists.^ One of the most common building 
block of such heuristics is the valuation, which is a real valued function 
on the possible moves of the learning player. The valuation of a move 
reflects, very roughly, the desirability of the move. Given a valuation, 
a learning process can be defined by specifying two rules: 

• A strategy rule, which specifies how the game is played for any 
given valuation of the player; 

• A revision rule, which specifies how the valuation is revised after 
playing the game. 



1 Perhaps the concentration of the AI literature on moves rather than strategies 
is the reason why there seems to be almost no overlap between two major books 
on learning, each in its field: The Theory of Learning in Games, Fudenberg and 
Levine (1998) and Reinforcement Learning: An Introduction, 3utton and Barto 
(1998|). 
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Our purpose here is to study learning-by-valuation processes, based 
on simple strategy and revision rules. In particular, we want to demon- 
strate the convergence properties of these processes in repeated games, 
where the stage game is given in an extensive form with perfect infor- 
mation and any number of players. Converging results of the type we 
prove here are very common in the literature of game theory. But as 
noted before, convergence of reinforcement is limited in this literature 
to strategies rather than moves.0 To the best of our knowledge, the 
AI literature while describing dynamic processes closely related to the 
ones we study here do not prove convergence results of this type. 

First, we study stage games in which the learning player has only 
two payoffs, 1 (win) and (lose). Two-person win-lose games are a 
special case. But here, there is no restriction on the number of the 
other players or their payoffs. 

For these games we adopt the simple myopic strategy rule. By this 
rule, the player chooses in each of her decision node a move which has 
the highest valuation among the moves available to her at this node. 
In case there are several moves with the highest valuation, she chooses 
one of them at random. 

As a revision rule we adopt the simple memoryless revision: after 
each round the player revises only the valuation of the moves made in 
the round. The valuation of such a move becomes the payoff (0 or 1) 
in that round. 

Equipped with these rules, and an initial valuation, the player can 
play a repeated game. In each round she plays according to the myopic 
strategy, using the current valuation, and at the end of the round she 
revises her valuation according to the memoryless revision. 

This learning process, together with the strategies of the other play- 
ers in the repeated game, induce a probability distribution over the 
infinite histories of the repeated game. We show the following, with 
respect to this probability. 

Suppose that the learning player can guarantee a win in the 
stage game. If she plays according to the myopic strategy and 



2 There is no obvious way to define an assessment for a strategy from a system 
of node valuations. Therefore, a simple translation of our learning model in terms 
of strategies is not straightforward. One fundamental difficulty is that the node 
valuation treatment does not impose that a strategy be assessed in the same way 
throughout the play of the game. Also, two strategies involving the same first move 
should be assessed in the same way initially (a condition which does not make much 
sense in the reinforcement learning based on the strategic form. 
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the memoryless revision rules, then starting with any non- 
negative valuation, there exists, with probability 1, a time 
after which the player always wins. 

When the learning player has more than two payoffs, the previous 
learning process is of no help. In this case we study the exploratory 
myopic strategy rule, by which the player opts for the maximally val- 
ued move, but chooses also, with small probability, moves that do not 
maximize the valuation. 

The introduction of such perturbations makes it necessary to strengthen 
the revision rule. We consider the averaging revision. Like the mem- 
oryless revision, the player revises only the valuation of moves made 
in the last round. The valuation of such a move is the average of the 
payoffs in all previous rounds in which this move was made. 

If the learning player obeys the exploratory myopic strategy 
and the averaging revision rules, then starting with any val- 
uation, there exists, with probability 1, a time after which 
the player's payoff is close to her individually rational payoff 
(the maxmin payoff) in the stage game. 

The two previous results indicate that reinforcement learning achieves 
learning of playing the stage game itself, rather than playing against 
certain opponents. The learning processes described guarantee the 
player her individually rational payoff (which is the win in the first 
result). This is exactly the payoff that she can guarantee even when 
the other players are disregarded. 

Our next result concerns the case where all the players learn the 
stage game. By the previous result we know that each can guarantee 
his individually rational payoff. But, it turns out that the synergy of 
the learning processes yields the players more than just learning the 
stage game. Indeed, they learn in this case each other's behavior and 
act rationally on this information. 

Suppose the stage game has a unique perfect equilibrium. If 
all the players employ the exploratory myopic strategy and 
the averaging revision rules, then starting with any valuation, 
with probability 1, there is a time after which their strategy 
in the stage game is close to the perfect equilibrium. 

Although valuation is defined for all moves, the learning player needs 
no information concerning the game when she start playing it. Indeed, 
the initial valuation can be constant. To play the stage game with 
this valuation, the player needs to know which moves are possible to 
her, only when it is her turn to play, and then choose one of them at 
random. During the repeated game, the player should be able to record 
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the moves she made and their valuations. Still, the learning procedure 
does not require that the player knows how many players there are, let 
alone the moves they can make and their payoffs. 

The learning processes discussed here treat separately the valuation 
for every node. For games with large number of nodes (or states of 
the board), that may be unrealistic because the chance of meeting a 
given node several times is too small. In chess, for example, almost 
any state of the board, except for the few first ones, has been seen 
in recorded history only once. In order to make these processes more 
practical, similar moves (or states of the board) should be grouped 
together, such that the number of similarity classes is manageable. 
When the valuation of a move is revised, so are all the moves similar 
to it. We will deal with such learning processes, as well as with games 
with incomplete information, in a later paper. 

2. Preliminaries 

2.1. Games and super games. Consider a finite game G with com- 
plete information and a finite set of players I. The game is described 
by a tree (Z, N, r, A), where Z and iV are the sets of terminal and non- 
terminal nodes, correspondingly, the root of the tree is r, and the set 
of arcs is A. Elements of A are ordered pairs (n,m), where to is the 
immediate successor of n. 

The set iVj, for % G /, is the set of nodes in which it is i's turn to 
play. The sets iVj form a partition of N. The moves of player % at 
node n E Ni are the nodes in Mj(ra) = {to | (n, to) G A}. Denote 
Mi = \J ne NiMi(n) . For each i the function Z — > R is i's payoff 
function. The depth of the game is the length of the longest path in 
the tree. A game with depth is one in which {r} = Z and iV = 0. 

A behavioral strategy, (strategy for short) for player i is a function 
<7j defined on JVj, such that for each n G N iy <Ji{n) is a probability 
distribution on Mj(n). 

The super game V is the infinitely repeated game, with stage game 
G. An infinite history in V is an element of Z u . A finite history of t 
rounds, for t > 0, is an element of Z f . A super strategy for player i in 
T is a function on finite histories, such that for h G Z f , Sj(/i) is a 
strategy of % in G, played in round t+1. The super strategy E = (Ej)j 6 j 
induces a probability distribution on histories in the usual way. 

2.2. Valuations. We fix one player % (the learning player) and omit 
subscripts of this player when the context allows it. We first introduce 
the basic notions of playing by valuation. A valuation for player i is a 
function v. Mi — > R. 
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Playing the repeated game T by valuation requires two rules that 
describe how the stage game G is played for a given valuation, and 
how a valuation is revised after playing G. 

• A strategy rule is a function v — > a v . When player i's valuation is 
v, i's strategy in G is o^. 

• A revision rule is a function ft,) — > t> /l , such that for the empty 
history A, v A = i> . When player i's initial valuation is i>, then 
after a history of plays h, i's valuation is ?A 

Definition 1. TTie valuation super strategy /or player i, induced by 

a strategy rule v — > a 1 ', a revision rule (v, h) — > i;' 1 , an<i an initial 
valuation v, is the super strategy which is defined by = a v 

for each finite history h. 



3.1. Win-lose games. We consider first the case where player i has 
two possible payoffs in G, which are, without loss of generality, 1 (win) 
and (lose). A two-person win-lose game is a special case, but here we 
place no restrictions on the number of players or their payoffs. 

We assume that learning by valuation is induced by a strategy rule 
and a revision rule of a simple form. 

The myopic strategy rule. This rule associates with each valuation 
v the strategy a v , where for each node n G N i} a v {n) is the the uniform 
distribution over the maximizers of v on Mj(n). That is, in each node 
of player i, the player selects at random one of the moves with the 
highest valuation. 

The memoryless revision rule. For a history h = (z) of length 1, 
the valuation v is revised to v z which is defined for each node m G Mj(n) 



For a history h = (z±, . . . , z t ), the current valuation is revised in each 
round according to the terminal node observed in this round. Thus, 



The temporal horizons, future and past, required for these two rules 
are very narrow. Playing the game G, the player takes into consider- 
ation just her next move. The revision of the valuation after playing 
G depends only on the current valuation, and the result of this play, 
and not on the history of past valuations and plays. In addition, the 



3. Main results 



by 




m is on the path leading from r to z, 
otherwise. 




>l,...,Z t _l)\2t 
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revision is confined only to those moves that were made in the last 
round. 

Theorem 1. Let G be a game in which player i either wins or loses. 
Assume that player i has a strategy in G that guarantees him a win. 
Then for any initial nonnegative valuation v ofi, and super strategies £ 
in V, z/Ej is the valuation super strategy induced by the myopic strategy 
and the memoryless revision rules, then with probability 1, there is a 
time after which i is winning forever. 

The following example demonstrates learning by valuation. 

Example 1. Consider the game in Figure [I], where the payoffs are 
player f's. 



f 




f 



Figure 1. Two payoffs 

Suppose that f's initial valuation of each of the moves L and R 
is 0. The valuations that will follow can be one of (0,0), (1,0), and 
(0, 1), where the first number in each pair is the valuation of L and the 
second of R. (The valuation (1,1) cannot be reached from any of these 
valuations). 

We can think of these possible valuations as states in a stochastic 
process. The state (0, f ) is absorbing. Once it is reached, player 1 
is choosing R and being paid 1 forever. When the valuation is (1,0), 
player 1 goes L. She will keep going L, and winning 1, as long as player 
2 is choosing a. Once player 2 chooses b, the valuation goes back to 
(0, 0). Thus, the only way player 1 can fail to be paid f from a certain 
time on is when (0, 0) recurs infinitely many times. But the probability 
of this is 0, as the probability of reaching the absorbing state (0, f ) from 
state (0,0) is 1/2. 

Note that the theorem does not state that with probability 1 there 
is a time after which player l's strategy is the one that guarantees him 
payoff f. Indeed, in this example, if player 2's strategy is always a, 
then there is a probability 1/2 that player 1 will play L for ever, which 
is not the strategy that guarantees player f the payoff 1. 
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3.2. The case of payoff function with more than two values. 

We now turn to the case in which payoff functions take more than two 
values. The next example shows that in this case the myopic strategy 
and the memoryless revision rules may lead the player astray. 

Example 2. Player 1 is the only player in the game in Figure [| 



1 




10 -10 



Figure 2. More than two payoffs 

In this game player 1 can guarantee a payoff of 10, and therefore 
we expect a learning process to lead player 1 to this payoff. But, no 
reasonable restriction on the initial valuation can guarantee that the 
learning process induced by the myopic strategy and the memoryless 
revision results in the payoff 10 in the long run. For example, for 
any constant initial valuation, there is a positive probability that the 
valuation (—10,2) for (L,R) is obtained, which is absorbing. 

We cannot state for general payoff functions any theorem analogous 
to Theorem [I] or even a weaker version of this theorem. But something 
meaningful can be stated when all players play the repeated game 
according to the myopic strategy and the memoryless revision rules. 

We say that game G is generic if for every player i and for every pair 
of distinct terminal nodes z and z', we have fi(z) ^ fi{z'). 

Theorem 2. Let G be a generic game. Assume that each player i 
plays T according to the myopic strategy rule and uses the memoryless 
revision rule. Then for any initial valuation profile, with probability 1, 
there is a time after which the same terminal node is reached in each 
round. 

The limit plays guaranteed by this theorem depend on the initial 
valuations and have no special structure in general. Moreover, it is 
obvious that for any terminal node there are initial valuations that 
guarantee that this terminal node is reached in all rounds. 

We return, now, to the case where only one player learns by rein- 
forcement. In order to prevent a player from being paid an inferior 
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payoff forever, like in Example we change the strategy rule. We 
allow for exploratory moves that remind her of all possible payoffs in 
the game, so that she is not stuck in a bad valuation. Assume, then, 
that having a certain valuation, the player opts for the highest valued 
nodes, but still allows for other nodes with a small probability 5. Such 
a rule guarantees that player in Example |2| will never be stuck in the 
valuation (—10,2). We introduce formally this new rule. 

The ^-exploratory myopic strategy rule. This rule associates with 
each valuation v the strategy a v s , where for each node n e N i; a v s {n) = 
(1 — 5)a v {n) + 5fi(n). Here, &" is the strategy associated with v by the 
myopic strategy rule, and \i is the strategy that uniformly selects one 
of the moves at n. 

Unfortunately, adding exploratory moves does not help the player to 
achieve 10 in the long run, as we show now. Assume that the initial 
valuation of a and b is 10 and —10 correspondingly, and the valuation 
of the fist two moves is also favorable: (10,2). We assume now that 
in each of the two nodes player 1 chooses the higher valued node with 
probability 1 — 8 and the other with probability 5. The valuation of 
a and b cannot change over time. The valuation of (L, R) form an 
ergodic Markov chain with the two states {(10, 2), (—10, 2)}. Thus, for 
example, the probability of transition from (10, 2) to itself occurs when 
the player chooses either L and a, with probability (1 — 5) 2 , or R with 
probability 5, which sum to 1 — 5 + 5 2 . 

The following is the transition matrix of this Markov chain. 

(10,2) (-10,2) 
(10,2) fl- 5 + 5 2 5-5 2 \ 
(—10, 2) V 5 - 5 2 1-5 + 5 2 ) 

The two states (10,2) and (—10,2) are symmetric and therefore the 
stationary probability of each is 1/2. Thus, the player is paid 10 and 
2, half of the time each. 

Note that the exploratory moves are required because the payoff 
function has more than two values. However, the failure to achieve 
the payoff 10 after introducing the the 5-exploratory myopic strategy 
rule is the result of this rule, and has nothing to do with the number of 
values of the payoff function. That is, even in a win-lose game, a player 
who has a winning strategy may fail to guarantee a win in the long run 
by playing according to the rules of (^-exploratory myopic strategy and 
memory less revision. 

Thus, the introduction of the (^-exploratory myopic strategy rule 
forces us also to strengthen the revision rule as follows. 
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The averaging revision rule. For a node m G M i: and a history 
h = (zi, . . . , z t ), if the node m was never reached in h, then v h (m) = 
v{m). Else, let ti, . . . ,tk be the times at which m was reached in h, 
then 



We state, now, that by using little exploration, and averaging re- 
vision, player % can guarantee to be close to his individually rational 
(maxmin) payoff in G. 

Theorem 3. Let E be a super strategy such that Ej is the valuation 
super strategy induced by the 5-exploratory myopic strategy and the av- 
eraging revision rules. Denote by P$ the distribution over histories in 
T induced by E. 

Let p be i's individually rational payoff in G. Then for every e > 
there exists 5o > such that for every < 5 < 5o, for Ps-almost all 
infinite histories h = (z±, Z2, ■ ■ ■ ), 



We consider now the case where all players learn to play G, using 
the 5-exploratory myopic strategy and the averaging revision rules. 
We show that in such a case, in the long run, the players' strategy 
in the stage game is close to a perfect equilibrium. We assume for 
simplicity that the game G has a unique perfect equilibrium (which is 
true generically) . 

Theorem 4. Assume that G has a unique perfect equilibrium (3 = 
Let E 5 be the super strategy such that for each i, Ef is the 
valuation super strategy induced by the 5-exploratory myopic strategy, 
and the averaging revision rules. 

Let P$ be the distribution over histories induced by E" 5 . Then there 
exists 5q, such that for all < 5 < So, for Ps-almost all infinite his- 
tories h = (zi, . . . , z t , . . . ), there exists T, such that for all t > T, 
af zl ' "' H) (m) = (1 — S)Pi(m) + 5ji{m), for each player i and node 
m E Mi. 



v 




i=i 



4. PROOFS 



4.1. Stochastic repeated games. We prove all the theorems by in- 
duction on the depth of the game tree. For this we need to be able 
to deduce properties of T from properties of repeated games of stage 
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games G' which are subgames of G. This can be more naturally done 
when we consider a wider class of repeated games which we call sto- 
chastic repeated games. Within this class the repeated game of G' can 
be imbedded in the repeated game of G, thus enabling us to make the 
required deductions. 

Let S be a countable set of states which also includes an end state 
e. We consider a game V s in which the game G is played repeatedly. 
Before each round a state from S is selected according to a probability 
distribution which depends on the history of the previous terminal 
nodes and states. When the state e is realized the game ends. The 
selected state is known to the players. The strategy played in each 
round depends on the history of the terminal nodes and states. We 
now describe V s formally. 

Histories. The set of infinite histories in T s , is — (S x Z) u . For 
t > the set of finite history of t rounds, is H t — (S x Z) 1 , and the 
set of preplay histories of t rounds is Hf = (S x Z) 1 x S. Denote 
H = UZo H t and H p = l)£Lo H t x S. The subset of H p of histories that 
terminate with e is denoted by F. For h G and t > we denote by 
h t the history in H t which consists of the first t rounds in h. For finite 
and infinite histories h we denote by h the sequence of terminal nodes 



Transition probabilities. For each h G H, r(h) is a probability 
distribution on S. For s G S, r(h)(s) is the probability of transition to 
state s after history h. The probability that the game ends after h is 



Super strategies. After t rounds the player observes the history of t 
pairs of a state and a terminal node, and the state that follows them, 
and then plays G. Thus, a super strategy for player i is a function Ej 
from H P \F to i's strategies in G. We denote by T,(h)(z) the probability 
of reaching terminal node z when E(/i) is played. 
The super play distribution. The super strategy E induces the 
super play distribution which is a probability distribution P over U 
F. It is the unique extension of the distribution over finite histories 
which satisfies 



in h. 



r{h){e). 



(1) 



P(h,s) = P(h)r(h)(s) 



for h G H, and 



(2) 



P(h,z) =P(h)E(h)(z) 



for h G H p . 
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The valuation super strategy. Player i's valuation super strategy 
in V s , starting with valuation v, is the super strategy Sj which satisfies 

Xi(h) =<j v \ 

4.2. Subgames. We show now how a stochastic repeated game of a 
subgame of G can be imbedded in V s . 

For a node ra in G, denote by G n the subgame starting at n. Fix a 
super strategy profile £ in V s and the induced super play distribution 
P on ifoo. In what follows we describe a stochastic super game , in 
which the stage game is G n . For this we need to define the state space 
S' . We tag histories and states in the game , as well as terminal 
nodes in G n . Our purpose in this construction is to imbed H'^ in H^. 
The idea is to regard these rounds in a history h in Hoc in which node 
n is not reached as states in S'. 

Let S' be defined as the set of all h G H p , such that node n is never 
reached in h. Obviously, S' subsumes S, and in particular includes the 
end state e. Note that the set H'^ of infinite history in can be 
naturally viewed as a subset of H^, H' as a subset of H, and H' p as 
a subset of H p . We use this fact to define the transition probability 
distribution r' in as follows. 

For any s' ^ e in S' and h! E H with > 0, 

(3) = P(ti, s' | s')(n), 

where £(/i', s')( n ) i s the probability that node n is reached under the 
strategy profile S(/i', s'). For e, r'{h'){e) = P(E \ h'), where E consists 
of all histories h G U F with initial segment h' such that n is never 
reached after this initial segment. 

Note that T'(h')(s') is the probability of all histories in U F that 
start with (h! , s') and followed by a terminal node of the game G n . 
These events and the event E described above, form a partition of 
U F, and therefore r' is a probability distribution. 

Claim 1. Define a super strategy profile £' m r^" ; by 

(4) = S n (^') 

for each h' G i/' p ; where the right-hand side is the restriction ofT,(h f ) 
to G n - Then, the restriction of P to H'^ coincides with the super play 
probability distribution P' , induced by £'. 

Proof. It is enough to show that P and P' coincide on H' . The proof 
is by induction on the length of h! G H' . Suppose P'(h') = P{h!) > 
and consider the history (h, s', z'). Then, by the definition of the super 
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play distribution ([!]) and (0), 

P\tiis',z') = P\h')r'{h'){s')^{h',s'){z'). 

By the induction hypothesis and the definitions of t' in (|3|), the right- 
hand side is P(h', s')E(h', s')(n)Z'(h', s')(z'). By the definition of E' in 
(D, this is just P(/!',s')E(/i',s')(fi)S„(/i',s')(4 The right-hand side, 
in turn, is just P(h', s')E(h', s'){z') = P{h', s', z'). ■ 

Next, we note that playing by valuation is inherited by subgames. 

Claim 2. Suppose that i's strategy in V s , S i; is the valuation super 
strategy starting with v, and using either the myopic strategy and the 
memoryless revision rules, or the 5 -exploratory myopic strategy and the 
averaging revision rules. Then the induced strategy in , is the 
valuation super strategy starting with v n — the restriction of v to the 
subgame G n — and following the corresponding rules. 

Proof. The valuation super strategy in rf , starting with v n , requires 

h' — 

that after history h G H', strategy a Vn is played. Here, h! is the 
sequence of all terminal nodes in h', which consists of terminal nodes 
in G n . These are also all the terminal nodes of G n , in h', when the 
latter is viewed as a history in H. 

When h' is considered as a history in H, then the strategy 

is a vh , where h! is the sequence of all terminal nodes in h'. is 

the restriction of o~ vh to G n . But along the history h', the valuation 
of nodes in the game G n does not change in rounds in which terminal 
nodes which are not in G n are reached. Therefore, S-(/t') and &° are 
the same. ■ 

4.3. Win-lose games. The game V is in particular a stochastic re- 
peated game, where there is only one state, besides e, and transition to 
e (that is, termination of the game) has null probability. We prove all 
three theorems for the wider class of stochastic repeated games. The 
theorems can be stated verbatim for this wider class of games, with 
one obvious change: any claim about almost all histories should be 
replaced by a corresponding claim for almost all infinite histories. 

All the theorems are proved by induction on the depth of the game 
G. The proofs for games of depth (that is, games in which payoffs 
are determined in the root, with no moves) are straightforward and 
are omitted. In all the proofs, R = {ni, . . . ,n k } is the set of all the 
immediate successors of the root r. 

Proof of Theorem |l|. Assume that the claim of the theorem holds for 
all the subgames of G. We examine first the case that the first player is 
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not i. By the stipulation of the theorem, player i can guarantee payoff 
f in each of the games G nj for j — 1, . . . , k. 

Consider now the game , the super strategy profile E', and the 
induced super play distribution P' . By the induction hypothesis, and 
claim 2, for each j, for P'-almost all infinite histories there is a time 
after which player i is paid 1. In view of Claim 1, for P-almost all 
histories in V s in which rij is reached infinitely many times, there exist 
a time after which player % is paid 1, whenever rij is reached. Consider 
now a nonempty subset Q of R. Let Eq be the set of infinite histories in 
V s in which node rij is reached infinitely many times iff rij € Q. Then, 
for P-almost all histories in Eq there is a time after which player i is 
paid 1. The events Eq when Q ranges over all nonempty subsets of R, 
form a partition of the set of all infinite histories, which completes the 
proof in this case. 

Consider now the case that i is the first player in the game. In this 
case there is at least one subgame G nj in which % can guarantee the 
payoff 1. Assume without loss of generality that this holds for j = 1. 

For a history h denote by Rf the random variable that takes as 
values the subset of the nodes in R that have a positive valuation after 
t rounds. When Rf is not empty, then i chooses at r, with probability 
1, one the nodes in Rf . As a result the valuation of this node after 
the next round is or 1, while the valuation of all other nodes does 
not change. Therefore we conclude that Rf is weakly decreasing when 
Rf ^ 0. That is, P(Rf +1 C Rf \ Rf + 0) = 1. 

Let E + be the event that Rf = for only finitely many t's. Then, 
for P-almost all histories in E + there exists time T such that Rf is 
decreasing for t > T. Hence, for P-almost all histories in E + there is 
a nonempty subset R' of P, and time T, such that Rf = R' for t >T. 
But in order for the set of nodes in R with positive valuation not to 
change after T, player % must be paid 1 in each round after T. Thus 
we only need to show that P{E + ) = 0. 

Consider the event E 1 that ri\ is reached in infinitely many rounds. 
As proved before by the induction hypothesis, for P-almost all histories 
in E 1 , there exists T, such that the valuation of n\ is 1, for each round 
t > T in which n\ is reached. The valuation of this node does not 
change in rounds in which it is not reached. Thus, E 1 C E + P-almost 
surely. 

We conclude that for P-almost all histories in E+ there is a time 
T, such that ri\ is not reached after time T. But P-almost surely for 
such histories there are infinitely many t's in which the valuation of all 
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nodes in R is 0. In each such history, the probability that ri\ is not 
reached is 1 — 1/k, which establishes P(E + ) = 0. ■ 

Proof of Theorem [2|. Let z be the player at the root of G. By 
the induction hypothesis and Claim 1, for each of the supergames , 
j = 1, . . . , k, for P'-almost infinite histories in this super game, there 
is a time after which the same terminal node is reached. By Claim 2, 
for P-almost all histories of F in which rij recurs infinitely many times 
there is a time after which z''s valuation of this node is constantly the 
payoff of the same terminal node of G nj . 

It is enough that we show that for P-almost all infinite histories in 
V s , there is a time after which the same node from R is selected with 
probability 1 at the root. Suppose that this is not the case. Then 
there must be a set of histories E with P(E) > 0, two nodes rij and 
ni, and two terminal nodes Zj and Z\ in G nj and G ni correspondingly, 
that recur infinitely many times in this set. Therefore, for P-almost all 
histories in E, z's valuation of rij and ni is fi(zj) and fi(zi). Since G 
is generic, we may assume that fi(zj) > fi(zi). Thus, for P-almost all 
histories in E, there is a time after which the conditional probability 
of rii given the history is 0. Which is a contradiction. ■ 

4.4. The case of payoff functions with more than two values. 

We prove Theorem ^ for stochastic repeated games, where the conclu- 
sion of the theorem holds for Ps-almost all infinite histories. 

Proof of Theorem ^. Assume that the claim holds for all the 
subgames of G. We denote by pj, z's individually rational (maxmin) 
payoff in G n . . 

We denote by z's average payoff at time t in history h. Fix a 

subgame G nj . Histories in the game F^. are tagged. Thus, /*(/&') is z's 
average payoff at time t in history hi in T^. . 

Let h be a history in T in which rij recurs infinitely many times at 

ti, t%, Let h — (zi, Z2, ■ ■ ■ ). Denote by fHh) z's average payoff until 

t at the times rij was reached, that is, 

The history h can be viewed as an infinite history hi in Y^. . Moreover, 
for each I, f l (h') = fj l (h). By the definition of fHh), it follows that 
if there exists L such that for each I > L, f l (h') > pj — e, then there 
exits T such that for each t > T, n(h) > pj — e. By the induction 
hypothesis there is 5 , such that for all < 5 < 5 , for Pj'-almost all 
histories hi there exists such an L. Thus, by Claims [I] and 0, there 
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exists 5 , such that for all j and < 5 < 5 , for P^-almost all histories 
h in V s in which rij recurs infinitely many times, there exists a time T 
such that for each t > T, fj(h) > pj — e. 

We examine first the case that the first player is not i. Obviously, in 
this case, p = min,, pj. 

Let Q be a nonempty subset of R, and let Eq be the set of all infinite 
histories in which the set of nodes that recurs infinitely many times is 
Q. Consider a history h in Eq, with h = (z±, zi, ■ ■ • )• Let fj(h) be the 
number of times rij is reached in h until time t. Then, 

k 

I\h) = -i^u](h)f](h)> min fjh), 

where the inequality holds, because = an d fc> r j £ Q, 

Vj(h) = 0. Thus for P^-almost all histories h in Eq, 

lim t ^oo/*(/i) > lim^oo min fUh) 

> min lim^oo/^/i) 
j-.njeQ 

> min — £ 

j-.njeQ 

> p — £. 

Since this is true for all Q, the conclusion of the theorem follows for all 
infinite histories. 

Next, we examine the case that % is the first player. Note that in 
this case, for each node rij, fj(h) = v ht {rij). Observe, also, that for P$- 
almost all infinite histories h in V s , each of the subgames G nj recurs 
infinitely many times in h. Indeed, after each finite history, each of 
the games G nj is selected by i with probability 5 at least. Thus, the 
event that one of these games is played only finitely many times has 
probability 0. 

Let X t be a binary random variable over histories such that X t (h) = 
1 for histories h in which the node rij selected by player i at time t 
satisfies, 

(5) v ht (n j0 )>p-e/2, 
and X t = otherwise. 

Claim 3. There exists 5 such that for all j = 1 . . . k and any < 5 < 
5 , for Ps-almost all infinite histories h in V s there is time T such that 
for all t > T, 

(6) v h \n j )>p j -e/A, 
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(7) |^K-)-^(ni)l<^ 
for each history h! such that h' t = h t , and 

(8) E s (X t+1 \h t )>l-5, 

where E$ is the expectation with respect to P$. 

The inequality (Q) follows from the induction hypothesis. For (^), 
note that if nj is not reached in round t + 1 then the difference in (|7|) 
is 0. If nj is reached then v ht+1 = \yv ht (nj) + f(z t +i))/(h> + 1), where 
v is the number of times nj was reached in h t and f(z t +i) is the payoff 
in round t + 1. But, v goes to infinity with t, and thus (|7]) holds for 
large enough t. 

For (§), observe that (|6|) implies max^ v ht {rij) > p — e/4, as p = 
maXjPj. Then, by (|?p, max,,- v ' t+1 (nj) > p — e/2 for each history h' such 
that h' t = h t . Therefore, after h t , player i chooses, with probability at 
least 5, a node nj that satisfies (|5|), which shows (|8|). 

The information about the conditional expectations in (|8|) has a sim- 
ple implication for the averages of X t . To see it we use the following 
convergence theorem from Loeve (1963) p. 387. 

Stability Theorem. Let X t be a sequence of random variables with 
variance a\. If 

oo 

(9) J2^/t 2 <oo, 

t=l 

then 

1 * 

(10) X t - - E{X l \X U ..., - 0, 

1 i=i 

almost surely, where X t = (l/t) Yfi=i Xi- 

Consider now the restriction of the random variables X t to the set of 
infinite histories with conditioned on this space. From (§) it follows 
that on this space, almost surely lim^oo j^2 l=1 E(Xi | hi)) > 1 — 5. 

Therefore, almost surely lim^oc, | Y^i=i \ Xi, . . . , X^ij) > 1 — 5. 
This is so, because the field generated by the the random variables 
(Xi, . . . , X[_i) is coarser than the field generated by histories h t . Since 
condition (|9|) holds for X t , it follows by the Stability Theorem that for 
P^-almost all infinite histories h, 

(11) lim^ooXt > 1 -6. 
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By the definition of X t , 
1 k 

f^ h ) = t E ^K^K) > Mh)(p - e/2) + (1 - X t (h))M, 

3=1 

where M is the minimal payoff in G. If we choose Sq such that (1 — 
5 )(p-e/2)+5 M > p—e, then by (JTT|) , for each 5 < 5 , lim^oo f\h) > 
p — e for P^-almost all infinite histories. ■ 

The proof of Theorem [| is also extended to stochastic repeated 
games. We show that the conclusion of the theorem holds for P^-almost 
all infinite histories. 

Proof of Theorem |3j. Assume that the claim of the theorem holds for 
all the subgames of G. We denote by Vj the restriction of the valuation 
v to G n ., and by /3jj, z's perfect equilibrium strategy there, which is 
also the restriction of $ to this game. 

Claim 4. Let i be the player at the root, ttj be 's payoff in the perfect 
equilibrium of G nj , and e > 0. 

Then there exists So > such that for all < 5 < 5o, node nj, and 
player i, for P' s almost all infinite histories hi o/T„. there exists T such 
that for all t > T , 

K 

(12) a, 1 (m) = (1 - 8)Pij{m) + 5p(m) 
for each node m G Mj in G nj , and 

(13) lEsifp^hD-vjlKe 

where E$ is the expectation with respect to P' & , and /- + is i 's payoff in 
round t + 1 . 

The equality (|1^) is the induction hypothesis. Consider a history h' t 
for which (|1^) holds. In the round that follows h' t , the perfect equilib- 
rium path in G n . is played with probability (1 — S)^ 1 at least, where d 
is the depth of G. Player i 's payoff in this path is tTj. Thus for small 
enough 5 , ([□]) holds. 

By Claims |T| and || it follows from ([□]) that for < 5 < 5 , for P s 
all histories h in T, there exists T such that for alH > T the strategies 
played in each of the games T^, is the perfect equilibrium of G nj . Thus, 
to complete the proof it is enough to show that in addition, at the root, 
io chooses in these rounds, with probability 1 — 5, the node rij for which 
A'o( r ) = n jo- F° r this we need to show that io's valuation of n JO is higher 
than the valuation of all other nodes nj. 

To show it, let 3e be the difference between irj and the second high- 
est payoffs n j. By the assumption of the uniqueness of the perfect 
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equilibrium, e > 0. Note that as all players' strategies are fixed for 
t > T, lim^oo | Yli=i Esifi^Wt) exists. Using the stability Theorem, 
as in Theorem §, we conclude that lim^oo fj{h!) exists, and by ( |i"3"D the 
inequality | lim^oo fj(h') — itj\ < e holds, where n(h') is io's average 
payoff until round t of history h', in the game 1^.. 

As in the proof of Theorem || it follows that for P^-almost all infinite 
histories h in T, | lim^oo v ht (rij) —7Tj\ < s. But then, for P^-almost all 
infinite histories h there exists T such that for all t > T, v ht (rij ) is the 
highest valuation of all the nodes rij. ■ 
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